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ABSTRACT 

Since GNU/Linux became a popular operating system on 
computer network routers, its packet routing mechanisms 
attracted more interest. This does not only concern “big” 
Linux servers acting as a router but more and more small 
and medium network access devices, such as DSL or cable 
access devices. 

Although there are a lot of documents dealing with high 
performance routing with GNU/Linux, only a few offer ex¬ 
perimental results to prove the given advices. This study 
evaluates the throughput performance of Linux’ routing sub¬ 
system netfilter under various conditions like different data 
transport protocols in combination with different IP address 
families and transmission strategies. Those conditions were 
evaluated with two different types of netfilter rules for a 
high number in the rule tables. In addition to this, our expe¬ 
riments allowed us to evaluate two prominent client connec¬ 
tion handling techniques (threads and the epollO facility). 

The evaluation of the 1.260 different combinations of our 
test parameters shows a nearly linear but small throughput 
loss with the number of rules which is independant from the 
transport protocol and framesize. However, this evaluation 
identifies another issue concerning the throughput loss when 
it comes to the address family, i.e. IPv4 and IPv6. 

Categories and Subject Descriptors 

C.2.6 [Internetv^orking] : Routers; C.4 [Performance 
of systems]: Measurement techniques 

Keywords 

Linux, netfilter, performance 

1. INTRODUCTION 

One of the benefits of the Linux kernel is the availability 
for nearly every technical architecture. The combina¬ 
tion with the GNU operating system (often referred as 
GNU/Linux or simply Linux) makes it a good choice 
for router in computer networks because its memory 
footprint is quite small based on the modularity of the 
kernel modules. 


In addition to this, the Linux kernel has out-of-the- 
box routing capabilities as well as advanced packet fil¬ 
ter and transformation mechanisms which can be found 
in the netfilter framework inside of the Linux kernel. 
Quality of service based classification and priorization 
are available, too. 

Along with other key features such as the big varie¬ 
ty of server software, GNU/Linux is now one of the 
most preferred operating systems especially for small 
routing devices, for example DSL or cable access de¬ 
vices in end-user environments. Another famous ex¬ 
ample for GNU/Linux is the usage in wireless access 
routers known as DD-WRT. 

Those devices as well as “big” GNU/Linux routers, 
for example PCs or servers, share the same disadvan¬ 
tage: the routing and filtering is based on software 
whose execution time is influenced by many factors, for 
example CPU, main memory and hardware drivers. In 
the worst case, the technical components of the router 
are not performant enough to process the data packets 
and they are delayed or discarded. 

The main objective of this study is to evaluate the im¬ 
pact of netfilter rules on the throughput rate per client 
in a distributed client-server application, i.e. a perfor¬ 
mance test. Although we are aware of the fact, that 
netfilter features a variety of filter rules, we focus for 
our experiments only on the most interesting rules for 
router operators: rules for both permitting clients to 
pass the router (ACL) and measuring their traffic vo¬ 
lume (known as IP aeeounting) as well as rules for reg¬ 
ulating the available network bandwidth among those 
clients (known as QoS). 

In section we describe the reasons why we did not 
use the test apparatus for this kind of performance test 
that is suggested in RFC 3511. Additionally, we speci¬ 
fy our test apparatus and the extended set of possible 
influence parameters. 

Since we used our own test apparatus, we were able 
to clearify another aspect in client-server applications: 
the handling of client connections in the server compo¬ 
nent. We evaluated two widely used kinds in terms of 
the throughput rate. The first is to handle each client 
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Figure 1: netfilter performance testing architec¬ 
ture 

connection in its own thread (threading) and the other 
is the epollO facility offered by the Linux kernel that 
proclaims to be more performant and easier to imple¬ 
ment. Our results along with other observations are 
discussed in section [3l 

2. TEST APPARATUS 

The test apparatus follows the guidelines described in 
RFC 3511 [^. Basically it is a client-server architecture 
where a central gateway filters and transforms the data 
transmissions between the clients and the server. 

Contrary to RFC3511, we did not use the suggested 
HTTP benchmark because we were fundamentally in¬ 
terested in the evaluation of a bigger number of influ¬ 
ence parameters than only HTTP transactions per se¬ 
cond. All test parameters that we were interested in 
are listed in table We developed a distributed appli- 
catioriH instead that has the same semantics like other 
popular command line benchmark tools like iperf or 
netperf^ but incorporates a third component gateway in 
the client-server concept. 


Table 1: Parameters of a test case 


Parameter 

Description and tested values 

n 

Number of client threads (5,10,20,40,80, 
160,320) 

t 

Duration of the experiment (100s) 

/ 

Frame size (either fixed [64,128,256,512, 
1024] or ranged between 64 and 1024) 

P 

Transport protocol for the transmission 
(either TCP, UDP or SCTP) 

A 

Address family (either IPv4 or IPv6) 

T 

Server component uses threads for han¬ 
dling the client connections (only valid for 
stream oriented protocols) 

F 

netfilter rule generation per client thread: 

0 for plain forwarding, 2 for up- and down¬ 
load and 4 for additional QoS marks 


As shown in figure our application consists of three 
independant command line tools. The communication 
between those three components can be divided into 

^(Link removed according to double-blind review process). 


two parts: a) the control connection between the client 
and the gateway as well as the server component and 
b) the data connections between the client threads and 
the server thread(s). The control connection is used 
for sending the test parameters to the components and 
(once they configured themselfes according to the pa¬ 
rameters) to signal the start and the end of the specific 
experiment. 

Every experiment follows the same steps: 

1. The test parameters are given as command line pa¬ 
rameters when the client component starts. The 
parameters are validated and transferred to the 
gateway component. The gateway component con¬ 
figures the netfilter subsystem according to the 
submitted test parameters by inserting appropiate 
filter rules as specified in test parameter F. 

2. When the gateway components signals its readi¬ 
ness for the configuration, the client component 
submits the test configuration to the server compo¬ 
nent. The server component then awaits any client 
connections by opening a server socket. When this 
socket is successfully opened and bound, the server 
component signals its readiness back to the client 
component. 

3. The client component initializes and executes n 
client threads. They subsequently connect to the 
server component. According to the parameter T 
the server component handles each of the client 
connection in a) its own thread or b) in a single 
thread using the epollO facility. 

4. Once all client threads are connected, they be¬ 
gin to send and to receive data packets accord¬ 
ing to the test parameters P,A and /. Every of 
the n client-server-connections has its own inde¬ 
pendant sending/receiving cycle as depicted in fi¬ 
gure the client sends a specific amount of data 
(measurement point 1), the server thread receives 
the data (measurement point 2) and echos it back 
to the client thread (measurement point 3). The 
client thread finally receives the data (measure¬ 
ment point 4). 

5. When the test duration t is reached, the client 
threads get a signal to end the current sending/re¬ 
ceiving cycle, to disconnect from the server com¬ 
ponent and finally to end. The client component 
instructs the server and then the gateway compo¬ 
nent to restore the system state that existed before 
the experiment. 

Eor each of the four measurement points as shown in fi¬ 
gure several values were recorded. Eor measurement 
point 1 and 3 the number of successful sent/unsent data 
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Figure 2: Measurement points of the send¬ 
ing/receiving cycle 

frames and the frame sizes were saved. The term “un¬ 
sent” in this context means that a data frame could not 
be send successfully within a specific timeout (500 ms). 

For measurement point 2 and 4 the number of re¬ 
ceived data frames as well as the frame size and the 
result of the validation were saved. Please note that a 
read timeout was possible, but not used. For the valida¬ 
tion process every data frame sent by a client was filled 
with a data record that contains the following informa¬ 
tion: a) the number of the client in the range from 1 
to n b) the chosen frame size according to test param¬ 
eter / and consecutive sequence number starting with 
1 and raised with every send/receive cycle. This allows 
to validate if a received data frame belongs to the asso¬ 
ciated sender and the data was successfully transmitted 
by comparing the received amount of data with test pa¬ 
rameter /. In addition to this, it allows to detect “gaps” 
in the sending/receiving cycle. 

All those recorded values formed the basis for our 
evaluation. 

2.1 Test series 

We composed three test series based on the test para¬ 
meter F (refer to table Q: 

1. Plain forwarding: this test series only makes use 
of netfilteFs forwarding capabilities. This means 
that the gateway component is instructed to for¬ 
ward all data transmission between the client and 
the server component without any limitations, i.e. 
no netfilter rules were inserted. 

2. Simple up- and download rules: this test series is 
like the first one but the gateway component in¬ 
serts a upload and a download netfilter rule per 
client thread. The rules simply checks the IP ad¬ 
dresses and the protocol to teslj^ In total 2 • n 
rules are active for a specific experiment. At the 
end netfilter is instructed to discard any other data 
packet that does not conform with the inserted 
rules. This is done by setting the policy of the 
specific rule table to drop anything that was not 
matched by any existing rule. 


^To be more precise: “iptables -A FORWARD -s <client> 
-d <server> -p <protocol> -j ACCEPT” for the upload 
and vice versa for the download direction. 


3. Simple up- and download rules as well as QoS 
marks: this test series does the same as the se¬ 
cond one but additionally inserts netfilter rules per 
client thread that are responsible to tag in- and 
outgoing network data packets with a QoS marl0 
Those marks can be used within the iproute2 uti¬ 
lity collection to manipulate the QoS subsystem 
of the Linux kernel. In total 4 • n netfilter rules 
are inserted for a specific experiment. Please note 
that the QoS subsystems of all three test machines 
were not modified and used the default {pfifo-fast, 
a simple packet first-in-first-out queue with almost 
no overhead). 

The results of the first test series served us as a baseline 
for the other two. During the experiments the hardware 
metrics were recorded, e.g. CPU and main memory usa¬ 
ge. 

2.2 Test machines characteristics 

The machine for the client component has two AMD 
Opteron 870 CPUs with 4 cores each and 2 GHz fre¬ 
quency. The machines for the gateway and server com¬ 
ponent have two AMD Opteron 890 CPUs with 4 cores 
each and 2.8 GHz frequency. Each of the three ma¬ 
chines have 32 GByte of main memory (DDR2, EGG 
error correction). 

All test machines used a recent GNU/Linux distribu¬ 
tion (Ubuntu 14.04 in the 64 bit server edition) as an 
operating system with a recent Linux kernel (3.13-03). 
All unnecessary services were turned off. 

2.3 Network configuration 

Each of the three machines used for the tests has a 4- 
port network adapter with two Intel 8254-6EB chipsets. 
This allows four physical GBit connections. The gate¬ 
way machine is dual-homed with a physical GBit con¬ 
nection to each the client and server component ma¬ 
chine. Each the client and server component has its 
own IP network. 

Since the gateway machine is dual-homed, it can con¬ 
nect both networks and uses netfilter to route, to filter 
and to transform the data transmissions between the 
two networks. 

All settings that were available for the tested trans¬ 
port protocols and address families were left to their 
defaults. Although they offer the potential to raise the 
processing performance, the complexity in conjunction 
with our test parameters was too high. 


^The rule template for this is “iptables -t mangle -A 
PREROUTING -s <client> -d <server> -p <protocol> 

-j MARK —set-mark <QoSmark>” for the upload and vice 
versa for the download direction. 
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3. DISCUSSION OF THE TEST RESULTS 

We executed every test series three times and all shown 
results use the mean value; the variance was uniformly 
low. In total we executed 3.780 single experiments. 

3.1 General observations 

All three test series gave us a first impression of the 
throughput rate for the tested protocols and address 
families. The average throughput rate for all 3.780 ex¬ 
periments is depicted in figure and Both show the 
results for the tested address families and scaled to the 
potential transmission maximum of 1 GBit per second. 

The first figure shows the throughput rate grouped 
by the tested number of concurrent client threads. This 
way it is possible to estimate the average throughput for 
any application where the number of clients are known. 
Please note that the shown throughput rates already 
include the decrease resulted by netfilter^s filtering and 
routing. As visible in figure the average throughput 
rate is quite stable but decreases with a higher num¬ 
ber of concurrent clients. The only exception is SCTP 
where the throughput rate is surprisingly higher for 320 
concurrent clients than for 80 and 160. 

The latter figureshows the throughput rate grouped 
by the tested frame sizes. This figure also include all 
experiments where netfilter rules were involved. Unsur¬ 
prisingly the throughput rate increases with a higher 
frame size. The general case is shown in the last bar 
group labeled “ranged”. In this case the frame size was 
randomly choser]^ in a range between 64 and 1024 be¬ 
fore every send/receive cycle in every client thread. 

We can confirm the widely known fact that SCTP in 
terms of throughput is slower than TCP which is slower 
than UDP. Our results show that SCTP is in average 
32.65 percent slower than TCP (minimum/maximum 
difference: 9.28 and 48.23 percent) for all experiments. 
TCP however is in average 8.42 percent slower than 
UDP (minimum/maximum difference: 5.18 and 10.64 
percent). 

Our test results also showed that the throughput rate 
for IPv6 is noticeable lower than for IPv4. All tested 
protocols using IPv4 are in average 9.22 percent faster 
than with IPv6 (minimum/maximum difference: 4.59 
and 13.8 percent). 

As mentioned before, we recorded the available hard¬ 
ware usage statistics during all experiments. Compared 
to the statistics for our router machine in its idle state, 
the impact of the netfilter routing during the experi¬ 
ments in average is marginal. In fact, this depends on 
the utilized network adapters and the system drivers. 
Our network adapters featured a special network pro¬ 
cessor that massively reduced the CPU load by valida¬ 
ting incoming network data packets natively, e.g. cal- 

^A Mersenne random number generator was used. 
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Figure 3: Average throughput rate per client for 
the tested number of clients 
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Figure 4: Average throughput rate per client for 
tested frame sizes 


culating checksums and verifying packet headers, which 
is otherwise done by the operating system. 

3.2 Impact of netfilter 

As stated in the previous section, the first test series 
(without any netfilter rules) served us as a baseline for 
the other two that we executed (with different numbers 
of netfilter rules). 
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Average throughput decrease by concurrent clients 


For the second and third test series, we calculated the 
difference with the first one. The results showed a de¬ 
crease of 2.25 percent in average for all experiments 
where netfilter rules were involved. We summarized 
the average throughput decrease in figure (IPv4) and 

(IPv6). These figures show the average throughput 
decrease grouped by the tested client thread numbers 
and additionally for every tested protocol and number 
of active netfilter rules per client. 

As depicted in figure and the decrease is different 
for the tested address families: the decrease for IPv6 is 
lower than for IPv4 (2.71 vs. 1.79 percent in average). 
By considering the decrease percentages as a function of 
the number of inserted netfilter rules, we calculated the 
gradient for each tested protocol and address family. In 
average the gradients are nearly eonstant. This can be 
barely seen on figure and because the x-axis is not 
linearly scaled. To confirm the nearly constant impact 
of netfilter on the throughput rate, we reviewed our 
test results with respect to the number of routed data 
packets between the client and server thread(s) rather 
than the throughput rate. The review also proves our 
main findings: 

1. netfilter^s performance in terms of throughput is 
independant from the used transport protocol, fra¬ 
me size and address family as long as simple net- 
filter rules are active 

2. the throughput loss increases roughly linear with 
the number of inserted (simple) netfilter rules al¬ 
though this loss is quite insignificant 

The second main finding shown above allowed us to ex¬ 
press the throughput loss per netfilter rule: one can 
assume a throughput loss of 0.05 percent for any (sim¬ 
ple) IPv4 rule and 0.03 percent for any (simple) IPv6 
rule. 

3.3 Client handling techniques 

The last objective of this study was to evaluate the 
client handling techniques in a client-server applica¬ 
tion: the server component was instructed to handle 
the data transmissions of stream-oriented clients either 
in a separate thread per client or in a single thread using 
epollO. 

To consider the differences in the throughput between 
those two client handling techniques, we only used the 
experiment results of the first test series and only for 
SCTP and TCP as well as for both address families. 
The average difference is illustrated in figure as a 
percentage between the threaded and unthreaded tech¬ 
nique. 

This figure clearly indicates that there is a turning 
point which technique offers a higher throughput rate 
for a specific number of client connections to handle in 
a server process. 
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Figure 5: Average throughput decrease per 
client for IPv4 when netfilter rules are active 
(u/d = up- and download) 
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Figure 6: Average throughput decrease per 
client for IPv6 when netfilter rules are active 
(u/d = up- and download) 


In our experiments this turning point was around 40 
concurrent client connections. The technical specifica¬ 
tions of our test machine executing the server compo¬ 
nent states the native handling of 32 concurrent threads. 
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Average throughput difference of threaded/unthreaded 
client handling 
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than within a single thread. 
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for client handling within a single thread 
than with multiple threads,i.e. epollQ. 


5 10 20 40 



Number of concurrent clients 


Figure 7: Average throughput difference be¬ 
tween client handling with multiple threads and 
within a single thread 


This brought us to examine the system usage statis¬ 
tics that were recorded during the experiments. We 
noticed a significant increase of the number of context 
switches for our experiments with more than 40 con¬ 
current threads. A context switch takes place when the 
operating system saves the current state of a process or 
thread for a later execution in favor of the execution 
of another process or thread. This storing/restoring of 
contexts is quite expensive in terms of computation time 
and can cause the system to slown down. In contrast 
the same experiments with 40 or more client connec¬ 
tions that were handled via epollO in a single thread 
did not show this impact. 

In summary we recommend to use the epollO fa¬ 
cility of the Linux kernel in a client-server architecture 
in general. The reason is the better scalability com¬ 
pared to a client handling with threads for a higher 
number of client connections. Although the through¬ 
put rate is higher when threads are used, the rate dif¬ 
ference is not significantly higher compared to the hand¬ 
ling with epollO. In addition to this, an application 
using epollO can prevent the operating system from 
unnecessary context switches that also effects other con¬ 
current applications. 

4. RELATED WORK 

In March 2000, the netfilter routing subsystem was mer¬ 
ged into the Linux kernel as the succesor of the former 
subsystem ipchains. The first performance evaluations 


regarding this new subsystem were made by Hartmeier 
et al and Podey et al. ([^, [^). The comparision bet¬ 
ween their results concerning the throughput rate with 
ours for a high number of netfilter rules indicates the 
same correlations but also illustrates the improvements 
in the Linux kernel and netfilter subsystem since then. 

Further publications dealt with the architecture of 
netfilter to raise the performance. The netfilter rules 
are organized in tables that are consulted according to 
the state of a network data packet. In general, rule 
evalation is done sequentially in each table. Lyu et. 
all as well as Pulp ( E, i) classified rules for a later 
elimination of unnecessary rules. This decreased the 
overall effort to inspect a data packet within netfilter 
and lead to a better throughput. 

In addition to this, user-defined sub-tables can be 
created in each of netfiltePs pre-defined tables and can 
be used as a target for a rule. This allows the segmen¬ 
tation of the rule evalation. Pulp et all. showed that 
the rules can be organized as a trie to achive a faster 
rule evalation. 

In [^, Aeharya et. all collected real-world firewall 
rule sets of tier-1 internet service providers and the as¬ 
sociated usage statistics to form a model for analyza- 
tion. This model was later used to improve the rule sets 
in order to increase the throughput. 

Aeeardi et. all used a special expansion card with 
a programmable network processor to relocate the net¬ 
work data packet inspection in combination with a net- 
filter module for this purpose. Their results show a 
tremendous increase of packet processing in a worst-case 
scenario, e.g. a Denial-of-service attack. In this attack 
scenario, the netfilter router faces a massive amount of 
invalid packets. Aeeardi et. all demonstrated that their 
setup of programmable network processor and corre¬ 
sponding netfilter module can prevent the effects of a 
Denial-of-service attack. 

5. CONCLUSION AND FUTURE WORK 

In this study we presented the results of our experiments 
studying the impact of netfilter on the throughput rate. 
We tested different combinations of transport protocols, 
address families and frame sizes for an increasing num¬ 
ber of netfilter rules. In summary we found out that 
the throughput loss does not depend on those parame¬ 
ters. The throughput loss is also quite insignificant and 
rises roughly linear with the number of rules. Our ex¬ 
periments showed an average throughput loss of 0.05 
percent for any (simple) IPv4 rule and 0.03 percent for 
any (simple) IPv6 rule. 

In addition to this, we evaluated two prominent client 
handling strategies for the server component in a client- 
server application. We proved that up to a certain point 
a client handling with threads offer a higher but only 
slight performance gain compared to the counterpart 
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using the epollO facility. After this point the thread 
management is too expensive in terms of computation 
time and causes the throughput rate per thread to de¬ 
grade. The epollO facility in contrast does not show 
this behaviour. 

With the introduction of nftables as the designated 
successor of the current iptables^ a performance gain is 
expected (although it is based on netfilter^ too). As soon 
as nftables becomes stable, we will redo our experiments 
using this tool. 
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